Programming environment

Download raw data

This study training data

Table S1. Raw data list
name size modified_date id
Preprocess.R 2.6000e+03 06/02/2024 10:43 PM syn60236613
Sample_annotation.csv 8.9500e+04 05/30/2024 8:37 AM syn60157686
Probe_array.csv 7.3000e+07 05/30/2024 9:04 AM syn60157718
Probe_annotation.csv 5.6870e+08 05/30/2024 8:49 AM syn60157694
DetectionP_subchallenge1.csv 2.7340e+09 05/24/2024 5:35 AM syn59870646
DetectionP_subchallenge2.csv 4.9870e+09 05/24/2024 5:51 AM syn59872208
Beta_raw_subchallenge1.csv.gz 5.8690e+09 05/24/2024 5:19 AM syn59868755
Beta_raw_subchallenge2.csv.gz 1.1008e+10 05/24/2024 2:19 PM syn59898399

Download original series for understanding the data generation process

Load raw data

Table S3. This study training data list
GEO Number N
GSE128827 5
GSE228149 5
GSE200659 11
E_MTAB_9312 13
GSE74738 13
GSE108567 16
GSE75196 24
GSE115508 25
GSE98224 48
GSE69502 52
GSE204977 55
GSE169598 64
GSE100197 95
GSE232778 187
GSE144129 210
GSE167885 242
GSE75248 334
GSE71678 343

Methylation data preprocessing

Probe filtering

Probe normalization

Phenotype preprocessing

Phenotype preprocessing for conditions

Phenotype imputation for conditions

FGR imputation

PE imputation

PE onset imputation

HELLP imputation

Diandric triploid imputation

Miscarriage imputation

Preterm imputation

GDM imputation

SGA imputation

LGA imputation

SLGA imputation

IVF imputation

Subfertility imputation

Chorioamnionitis imputation

Apply imputation

Phenotype preprocessing for non-conditions

Predictive modeling

Phenotype correlation matrix among conditions and gestational age

Table S7. Categorical variables with pair-wise perfect separation.
V1 V2
anencephaly diandric_triploid
anencephaly hellp
anencephaly ivf
anencephaly lga
anencephaly miscarriage
anencephaly spina_bifida
anencephaly subfertility
diandric_triploid chorioamnionitis
diandric_triploid ivf
diandric_triploid lga
diandric_triploid sga
diandric_triploid subfertility
fgr anencephaly
fgr chorioamnionitis
fgr ivf
fgr spina_bifida
fgr subfertility
hellp chorioamnionitis
ivf chorioamnionitis
ivf hellp
ivf subfertility
miscarriage chorioamnionitis
miscarriage ivf
miscarriage lga
miscarriage sga
miscarriage subfertility
ms_ivf anencephaly
ms_ivf diandric_triploid
ms_ivf ivf
ms_ivf miscarriage
ms_ivf spina_bifida
ms_ivf subfertility
ms_subfertility hellp
ms_subfertility ivf
ms_subfertility subfertility
pe anencephaly
pe diandric_triploid
pe hellp
pe pe_onset
pe spina_bifida
pe_onset anencephaly
pe_onset chorioamnionitis
pe_onset diandric_triploid
pe_onset hellp
pe_onset ivf
pe_onset miscarriage
pe_onset spina_bifida
pe_onset subfertility
preterm diandric_triploid
preterm ivf
preterm miscarriage
preterm subfertility
sga hellp
sga ivf
sga lga
sga subfertility
spina_bifida diandric_triploid
spina_bifida hellp
spina_bifida ivf
spina_bifida miscarriage
spina_bifida subfertility
subfertility chorioamnionitis
subfertility hellp

## $fgr

## 
## $pe

## 
## $pe_onset

## 
## $preterm

## 
## $anencephaly

## 
## $spina_bifida

## 
## $gdm

## 
## $diandric_triploid

## 
## $miscarriage

## 
## $lga

## 
## $subfertility

## 
## $hellp

## 
## $chorioamnionitis

## $ivf

## 
## $subfertility

Model development

GA prediction

Normal-GA model was trained using samples without 12 of 13 available conditions. They were significantly correlated to GA: (1) fetal growth restriction (FGR); (2) PE; (3) PE onset (early/late/not applicable); (4) hemolysis, elevated liver enzyme, and low platelet (HELLP) syndrome; (5) anencephaly; (6) spina bifida; (7) diandric triploid; (8) miscarriage; (9) preterm delivery; (10) gestational diabetes mellitus (GDM); (11) large-for-gestational-age (LGA) infant; (12) subfertility; and (13) chorioamnionitis. We excluded preterm delivery because it was related to the outcome, i.e., GA, simply by definition.

FGR prediction

PE prediction

PE onset prediction

HELLP prediction

Anencephaly prediction

Spina bifida prediction

Diandric triploid prediction

Miscarriage prediction

Preterm prediction

GDM prediction

LGA prediction

Subfertility prediction

Chorioamnionitis prediction

GA res-true-conds prediction

GA res-true-conds elastic net

GA res-true-conds random forest

GA residual prediction

GA residual elastic net

GA residual random forest

GA res-comp-risk prediction

GA res-comp-risk elastic net

GA res-comp-risk random forest

GA res-full prediction

GA res-full elastic net

GA res-full random forest

GA res-seq-risk prediction

GA res-seq-risk elastic net

GA res-seq-risk random forest

GA res-cpg prediction

Res-CPG-GA model consisted of two models for <37 and ≥37 weeks’ gestation estimated by normal-GA model. The model numbers and periods were determined according to clinical knowledge and pursuing normal distribution of residual GA. Pregnancy termination <37 weeks’ gestation is presumably related to a medical indication, while during term pregnancy, both medical and non-medical indications might be encountered. Fitting residuals in term pregnancy using beta values might lead to overfitting. Nonetheless, we used two approaches, i.e., predicting residual GA during: (1) <37 weeks’ gestation only (Res-CPG-PR-GA); and (2) both <37 and ≥37 weeks’ gestation (Res-CPG-GA). The latter model training was restricted to samples with absolute residual GA >0.05. It was chosen based on visual judgement of quantile-to-quantile (QQ) plot to pursue linearity, hence, easily fitted by elastic net regression. We assumed that such approach would avoid overfitting on predicted GA which might be already well-estimated by normal-GA model. We tested this assumption by training Res-CPG model without the residual-GA restriction (Res-CPG-rev-GA).

GA res-conds prediction

Res-Conds-GA model was similar to Resfull-GA model but we used predictors of multiplication values for each predicted probability and residual GA estimated by a model for the corresponding condition. Specifically, we trained a model using beta values of DMPs among samples with a condition. The rationale was that the conditions have different trajectories of when pregnancies are terminated and each pregnant woman has a different set of probabilities of the conditions. For comparison purpose, we used the true and predicted probabilities of the conditions, either from the prediction (Resfull-GA) or imputation models (Res-GA) for the conditions, and without the multiplication procedure (all the probabilities equal to 1). This comparison tested the importance of phenotypic prediction performances in the parent model accuracy and robustness.

GA res-comb prediction

Res-Comb- and Res-CPG-Comb-GA models were considered because other conditions might affect pregnancy termination, not limited to the 12 conditions. The stacking order was considered different. In Res-Comb-GA model, we limited the degree of freedom of residual fitting using known phenotype information (Res-Conds-GA), thus, the second model only fitted the unexplained residual GA (Res-CPG-GA), simply to boost the prediction. Meanwhile, Res-CPG-GA model was assumed to generally fit both explained and unexplained residual GA. Since the resulting residual was presumably smaller, the overall prediction was less affected by the imperfect accuracies of Res-Conds-GA. Both Res-Comb- and Res-CPG-Comb-GA models consisted of three models for <37, ≥37 and ≤40, and >40 weeks’ gestation estimated by normal-GA model. The model numbers and periods were also determined according to clinical knowledge and pursuing normal distribution of residual GA. The estimated delivery date falls on 40 week’s gestation. Before this date, a pregnant woman might seek termination in advance due to a medical condition. Meanwhile, a normal pregnant woman might seek for termination since the delivery date. We used three approaches, i.e., predicting residual GA during: (1) <37 weeks’ gestation only (Res-Comb-PR-GA); (2) both <37 and ≥37 and ≤40 weeks’ gestation, i.e., term before the estimated delivery date (Res-Comb-PRTB-GA); and (3) all the three periods (Res-Comb-GA).

GA res-cpg-comb prediction

Model evaluation

We considered the first-iteration models were underperformed based on the validation set compared to other participants’ model using the 450k probes (Figure 2). In training set, we also observed the accuracies of those models inconsistently won against the top performer across different phenotypic subgroups (Figure 3). Hence, the phenotype information might be not as useful as expected. Nevertheless, Resfull-GA model performance was slightly more consistent than Res-, Res-CR-, and Res-Seq-GA models although their performances were better using whole training set. In the first iteration, we learned that the accuracies in predicting the conditions matter, since Resfull-GA employed the larger-training-size, prediction models instead of smaller-training-size, imputation models. The prediction model that used the true probabilities were underperformed. However, we argue that it was because the degree of freedom was lacking due to binary/dichotomic values (0/1). This finding also indicated the predicted probabilities were also lacking for the degree of freedom.

Res-CPG-GA model family mostly outperformed the first-iteration models in terms of the rank in the validation set, which jumped up to top-3 positions based on RMSE, and the subgroup-wise winning rates (Figure 2). Res-CPG- and Res-CPG-rev-GA models for <37 and ≥37 weeks’ gestation consistently outperformed Res-CPG-PR-GA model across the subgroups (Figure 3), although it coincidentally outperformed Res-CPG- and Res-CPG-rev-GA models in the validation set (Figure 2). This finding opposed our assumption on the overfitting potential (see Modeling strategies). We were inspired to improve the degree of freedom of phenotypic predicted probabilities, resulting in Res-Conds-GA model.

Overall, Res-Conds-GA model family did not outperform Res-CPG and Res-CPG-rev models (Figure 2). The winning rates across the subgroups between both model families were also similar, but Res-Conds-GA models finally won in a few subgroups in which the previous models consistently lose, e.g., diandric triploid and miscarriage (Figure 3). Oppositely, Res-Conds-GA models lose in 3 of the 16 datasets of origin, compared to only 1 dataset (GSE74738) for Res-CPG and Res-CPG-rev models. Res-Conds-GA models using the true and predicted probabilities were more accurate and robust than those using imputation models or without the multiplication (Table 1). In the validation set, the Res-Conds-GA using the predicted probabilities had higher RMSE than those of Res-CPG-rev, Res-CPG, and Res-CPG-PR, consecutively, but not the top performer.

Eventually, we were inspired to combine Res-CPG- and Res-Conds-GA modeling approaches. Res-Comb-GA model won in almost all subgroups, except 1 dataset of origin (GSE228149) (Figure 4), and was almost the same RMSE with the top performer in the validation set (1.077 vs. 1.076) (Table 1). Both r values were 0.966, but MAE of Res-Comb-GA was slightly higher than the top performer (0.888 vs. 0.863). Meanwhile, Res-CPG-Comb-GA model had the poorest performance compared to all of the other multistage prediction models (Figure 3). This finding underlines the importance of phenotypic-related information, although it was not perfectly accurate in predicting the conditions. Therefore, considering the robustness across subgroups, Res-Comb-GA model has a higher chance to win in the test set, particularly in sub-challenge 1. Hence, we stopped the iterative process in model development.

Performance comparison

Table S9. Model evaluation
model metric avg lb ub current_best win sub code rank val
Normal-GA RMSE 1.859 1.851 1.866 1.076 No
Normal-GA MAE 1.142 1.137 1.148 0.863 No
Normal-GA r 0.967 0.967 0.968 0.966 Yes
Res-GA (true) RMSE 1.276 1.272 1.281 1.076 No
Res-GA (true) MAE 0.868 0.866 0.871 0.863 No
Res-GA (true) r 0.981 0.981 0.981 0.966 Yes
Res-GA (true)* RMSE 1.172 1.168 1.176 1.076 No
Res-GA (true)* MAE 0.782 0.780 0.785 0.863 Yes
Res-GA (true)* r 0.984 0.984 0.984 0.966 Yes
Res-GA RMSE 1.521 1.515 1.527 1.076 No
Res-GA MAE 1.027 1.024 1.030 0.863 No
Res-GA r 0.973 0.973 0.973 0.966 Yes
Res-GA* RMSE 0.775 0.772 0.777 1.076 Yes 3 clearcut 8 1.4369
Res-GA* MAE 0.478 0.476 0.479 0.863 Yes 3 clearcut 8 1.132
Res-GA* r 0.993 0.993 0.993 0.966 Yes 3 clearcut 8 0.9416
Res-CR-GA RMSE 1.537 1.531 1.543 1.076 No
Res-CR-GA MAE 1.026 1.023 1.029 0.863 No
Res-CR-GA r 0.972 0.972 0.973 0.966 Yes
Res-CR-GA* RMSE 0.782 0.780 0.784 1.076 Yes 1 testthewaters 7 1.3706
Res-CR-GA* MAE 0.488 0.486 0.489 0.863 Yes 1 testthewaters 7 1.0735
Res-CR-GA* r 0.993 0.993 0.993 0.966 Yes 1 testthewaters 7 0.9476
Resfull-GA RMSE 1.329 1.324 1.334 1.076 No
Resfull-GA MAE 0.909 0.906 0.911 0.863 No
Resfull-GA r 0.979 0.979 0.980 0.966 Yes
Resfull-GA* RMSE 0.946 0.941 0.950 1.076 Yes 2 isitthedarkhorse 6 1.3552
Resfull-GA* MAE 0.613 0.611 0.615 0.863 Yes 2 isitthedarkhorse 6 1.073
Resfull-GA* r 0.990 0.990 0.990 0.966 Yes 2 isitthedarkhorse 6 0.9505
Res-Seq-GA RMSE 1.642 1.634 1.650 1.076 No
Res-Seq-GA MAE 1.082 1.078 1.085 0.863 No
Res-Seq-GA r 0.969 0.968 0.969 0.966 Yes
Res-Seq-GA* RMSE 1.198 1.194 1.201 1.076 No
Res-Seq-GA* MAE 0.844 0.842 0.846 0.863 Yes
Res-Seq-GA* r 0.986 0.986 0.986 0.966 Yes
Res-CPG-PR-GA RMSE 1.069 1.062 1.076 1.076 No 4 stepback 3 1.1381
Res-CPG-PR-GA MAE 0.562 0.559 0.564 0.863 Yes 4 stepback 3 0.9149
Res-CPG-PR-GA r 0.987 0.987 0.987 0.966 Yes 4 stepback 3 0.9625
Res-CPG-GA RMSE 0.555 0.552 0.558 1.076 Yes 5 stepbackfurther 5 1.2025
Res-CPG-GA MAE 0.271 0.270 0.272 0.863 Yes 5 stepbackfurther 5 0.9549
Res-CPG-GA r 0.996 0.996 0.996 0.966 Yes 5 stepbackfurther 5 0.9645
Res-CPG-rev-GA RMSE 0.512 0.506 0.519 1.076 Yes 6 stepbackabit 4 1.1867
Res-CPG-rev-GA MAE 0.183 0.182 0.185 0.863 Yes 6 stepbackabit 4 0.9363
Res-CPG-rev-GA r 0.997 0.997 0.997 0.966 Yes 6 stepbackabit 4 0.963
Res-Conds-GA† RMSE 0.686 0.684 0.688 1.076 Yes
Res-Conds-GA† MAE 0.446 0.445 0.447 0.863 Yes
Res-Conds-GA† r 0.995 0.995 0.995 0.966 Yes
Res-Conds-GA*‡ RMSE 0.810 0.808 0.813 1.076 Yes
Res-Conds-GA*‡ MAE 0.561 0.559 0.563 0.863 Yes
Res-Conds-GA*‡ r 0.992 0.992 0.992 0.966 Yes
Res-Conds-GA§ RMSE 0.819 0.817 0.822 1.076 Yes 7 slidingdoors 2 1.0949
Res-Conds-GA§ MAE 0.551 0.550 0.553 0.863 Yes 7 slidingdoors 2 0.8843
Res-Conds-GA§ r 0.992 0.992 0.992 0.966 Yes 7 slidingdoors 2 0.9642
Res-Conds-GA¶ RMSE 0.903 0.901 0.905 1.076 Yes
Res-Conds-GA¶ MAE 0.655 0.653 0.657 0.863 Yes
Res-Conds-GA¶ r 0.991 0.990 0.991 0.966 Yes
Res-Comb-PR-GA RMSE 0.724 0.721 0.726 1.076 Yes
Res-Comb-PR-GA MAE 0.496 0.495 0.497 0.863 Yes
Res-Comb-PR-GA r 0.994 0.994 0.994 0.966 Yes
Res-Comb-PRTB-GA RMSE 0.604 0.602 0.605 1.076 Yes
Res-Comb-PRTB-GA MAE 0.425 0.424 0.426 0.863 Yes
Res-Comb-PRTB-GA r 0.996 0.996 0.996 0.966 Yes
Res-Comb-GA RMSE 0.568 0.566 0.569 1.076 Yes 8 pointofdivergence 1 1.0772
Res-Comb-GA MAE 0.389 0.388 0.390 0.863 Yes 8 pointofdivergence 1 0.8876
Res-Comb-GA r 0.996 0.996 0.996 0.966 Yes 8 pointofdivergence 1 0.9663
Res-CPG-Comb-GA† RMSE 1.777 1.769 1.784 1.076 No
Res-CPG-Comb-GA† MAE 1.116 1.110 1.121 0.863 No
Res-CPG-Comb-GA† r 0.971 0.971 0.971 0.966 Yes
Res-CPG-Comb-GA‡ RMSE 1.818 1.810 1.825 1.076 No
Res-CPG-Comb-GA‡ MAE 1.152 1.147 1.158 0.863 No
Res-CPG-Comb-GA‡ r 0.969 0.969 0.969 0.966 Yes
Res-CPG-Comb-GA§ RMSE 1.828 1.821 1.836 1.076 No
Res-CPG-Comb-GA§ MAE 1.151 1.145 1.156 0.863 No
Res-CPG-Comb-GA§ r 0.969 0.968 0.969 0.966 Yes
Res-CPG-Comb-GA¶ RMSE 1.837 1.830 1.844 1.076 No
Res-CPG-Comb-GA¶ MAE 1.171 1.166 1.177 0.863 No
Res-CPG-Comb-GA¶ r 0.968 0.968 0.969 0.966 Yes

## $`Normal-GA`

## 
## $`Res-CPG-rev-GA`

## 
## $`Res-Conds-GA§`

## 
## $`Res-Comb-GA`

Best model predictors

Model submission

Submission 1: GA res-comp-risk random forest (testthewaters)

Submission 2: GA res-full random forest (isitthedarkhorse)

Submission 3: GA residual random forest (clearcut)

Submission 4: GA res-cpg-pr elastic net (stepback)

Submission 5: GA res-cpg elastic net (stepbackfurther)

Submission 6: GA res-cpg-rev elastic net (stepbackabit)

Submission 7: GA res-conds-pred elastic net (slidingdoors)

Submission 8: GA res-comb-pred elastic net (pointofdivergence/imperfectmirror)